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Abstract 

This paper proposes a novel framework for manifold-valued regression 
and establishes its consistency as well as its contraction rate. It assumes 
a predictor with values in the interval [0,1] and response with values in 
a compact Riemannian manifold M. This setting is useful for applica¬ 
tions such as modeling dynamic scenes or shape deformations, where the 
visual scene or the deformed objects can be modeled by a manifold. The 
proposed framework is nonparametric and uses the heat kernel (and its 
associated Brownian motion) on manifolds as an averaging procedure. It 
directly generalizes the use of the Gaussian kernel (as a natural model 
of additive noise) in vector-valued regression problems. In order to avoid 
explicit dependence on estimates of the heat kernel, we follow a Bayesian 
setting, where Brownian motion on M induces a prior distribution on 
the space of continuous functions (^([0,1], M). For the case of discretized 
Brownian motion, we establish the consistency of the posterior distribu¬ 
tion in terms of the Lq distances for any 1 < q < oo. Most importantly, we 
establish contraction rate of order 0(n“^^^^') for any hxed e > 0, where 
n is the number of observations. For the continuous Brownian motion we 
establish weak consistency. 


1 Introduction 

In many applications of regression analysis, the response variables lie in Rieman¬ 
nian manifolds. For example, in directional statistics [20, 12, 11] the response 
variables take values in the sphere or the group of rotations. Applications of di¬ 
rectional statistics include crystallography [22], altitude determination for nav¬ 
igation and guidance control [30], testing procedure for Gene Ontology cellular 

‘This work was supported by NSF awards DMS-09-56072 and DMS-14-18386 and the 
University of Minnesota Doctoral Dissertation Fellowship Program. 


1 



component categories [27], visual invariance studies [21] and geostatics [34]. 
Other modern applications of regression give rise to different types of manifold¬ 
valued responses. In the regression problem of estimating shape deformations of 
the brain over time (e.g., for studying brain development, aging or diseases), the 
response variables lie in the space of shapes [10, 24, 17, 3, 25, 9]. In the analysis 
of landmarks [16] the response variables lie in the Lie group of diffeomorphisms. 

The quantitative analysis of regression with manifold-valued responses (which 
we refer to as manifold-valued regression) is still in early stages and is signifi¬ 
cantly less developed than statistical analysis of vector-valued regression with 
manifold-valued predictors [1, 8, 26, 5, 28, 36, 7]. A main obstacle for ad¬ 
vancing the analysis of manifold-valued regression is that there is no linear 
structure in general Riemannian manifolds and thus no direct method for av¬ 
eraging responses. Parametric methods for regression problems with manifold¬ 
valued responses [10, 17, 21, 13, 16] directly generalize the linear or polynomial 
real-valued regressions to geodesic or Riemannian polynomial manifold-valued 
regression. Nevertheless, the geodesic or Riemannian polynomial assumption 
on the underlying function is often too restrictive and for many applications 
non-parametric models are required. To address this issue, Hein [15] and Bhat- 
tacharya [3] proposed kernel-smoothing estimators, where in [15] the predictors 
and responses take values in manifolds and in [3] the predictors and responses 
take values in compact metric spaces with special kernels. Hein [15] proved 
convergence of the risk function to a minimal risk (w.p. 1; conditioned on the 
predictor) and Bhattacharya [3] established consistency of the joint density func¬ 
tion of the predictors and the responses. However, the rate of contraction (that 
is, the rate at which the posterior distribution contracts to a <5 distribution 
with respect to the underlying regression function) of any previously proposed 
manifold-valued regression estimator was not established. To the best of our 
knowledge, rate of contraction was only established when both the predictor 
and response variables are real [32] and this work does not seem to extend to 
manifold-valued regression. 

The main goal of this paper is to establish the rate of contraction of a 
natural estimator for manifold-valued regression (with real-valued predictors). 
This estimator is proposed here for the first time. 

1.1 Setting for Regression with Manifold-Valued Responses 

We assume that the predictor t takes values in [0,1] and the response x takes 
values in a compact D-dimensional Riemannian manifold M. We denote the 
Riemannian measure on M by /i {dfj, is the volume form). We also assume 
an underlying function /q € (^([0,1],M), which relates between the predictor 
variables and response variables by determining a density function pjgp)(a;), so 
that 

Pfo(t){x). (1) 

We find it natural to define 

Pfo(t)ix) =Pa^{foit),x), ( 2 ) 
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where x) denotes the heat kernel on M centered at /o(t) and evaluated 

at time cr^. Equivalently, Pcr 2 {fo{t),x) is the transition probability of Brownian 
motion on M (with the measure p) from /o(t) to x at time We note that 
controls the variance of the distribution of x\t and as —)■ 0, the distribution 

of a;|t approaches In the special case where M = 


Pa2{fo{t),x) 




^ -Hx- /o(f)f ^ 


and this implies the common model: x — /o(t) 1 1 ~ N{0, ct^I). 

We also assume a distribution p(t) of t, whose support equals [0,1], though its 
exact form is irrelevant in the analysis. At last, we assume n i.i.d. observations 
{{ti, a:i)}r=i C [0,1] X M with the joint distribution and the density function 


Po ='[[p{ti)Pa^ifo{U),Xi). (3) 

i=l 


The aim of the regression problem is to estimate fo among all functions in 
C{[0,1], M) given the observations {{ti,Xi)}f^i. 

For simplicity, we denote throughout the rest of the paper 


V:=C{[0,1],M). 


1.2 Bayesian Perspective: Prior and Posterior Distribu¬ 
tions Based on the Brownian Motion 

Since the set of functions V includes Brownian paths, the heat kernel, which 
expresses the Brownian transition probability, can be used to form a prior distri¬ 
bution on V. For the sake of clarity, we need to distinguish between two different 
ways of using the heat kernel in this paper. The first one applies the heat kernel 
Pa^ifoi't),^) with t € [0,1], fo and x G M (see e.g., Section 1.1), where the 
time (or variance) parameter quantifies the “noise” in x w.r.t. the underlying 
function fo{t). The second one uses the heat kernel ph{x,y) with h G 1R+ and 
x,y € M, where the time parameter h inversely characterizes the “smoothness” 
of the path between x and y. The smaller h, the smoother the path between x 
and y (since smaller h makes it less probable for y to get further away from x). 
Using the heat kernel ph{x,y), we define in Section 1.2.1 a continuous Brown¬ 
ian motion (BM) prior distribution and in Section 1.2.2 a discretized BM prior 
distribution. Section 1.2.3 then defines posterior distributions in terms of the 
prior distributions and the given observations {(U,a:i)}r=i C [0,1] x M of the 
setting. 


1.2.1 The Continuous BM Prior on V 

We note that a function f G V can be identified as a parametrized path in M. 
Let’s assume that x G M is a starting point of this path, that is /(O) = x. 
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We denote Vx ■= {f & V : /(O) = a;}. Corollary 2.19 of [2] implies that 
there exists a unique probability measure Wx on Vx such that for any n G N, 
0 < < ... < = 1, and open subsets Ui,... ,Un € M, the following identify 

is satisfied 


Wxif e Vx I /(ti) e Cl, ..., /(t„) e c/„) = 

/ Ptr,-tr,-i (^n, Xn-l) • • • Pt 2 -ti {X2i Xi)pt^ {xi, x)dfi{xi) • • • dfl{Xn)- (4) 

JUlX...XUr^ 

We define the conditional prior distribution oi f G V given x G M hy Wx- We 
assume that the distribution of /(O) = a; is /a/^(M) and thus obtain that the 
prior distribution n(/) of / € 7^ is Wx x 

1.2.2 The Discretized BM Prior V 

The continuous BM prior often does not have a density function. We discuss 
here a special case of discretized BM, where the density function of the prior is 
well-defined. For 0 < h < 1 such that 1/h is an integer, we define PGF{h) as the 
set of piecewise geodesic functions from [0,1] to M, where for each 0 < k < 1/h, 
fc € N, the interval [kh, {k + l)/i] is mapped to the geodesic curve from f{kh) 
to f{{k + l)h). Each function in PGF{h) is determined by its values at f{kh). 
Let the distribution of /(O) be uniform w.r.t. the Riemannian measure p, and 
let the transition probability from f{kh) to /((fc -I- l)h) be given by the heat 
kernel ph{f{kh),f{{k + l)h)). Then the density function tt/, (w.r.t. p) of the 
discretized BM prior on PGF{h) can be specified as follows: 

1/h 

T^hif) = ^Ph{f{kh - h), f{kh)). 

The corresponding distribution is denoted by II/j. 

Throughout the paper we assume a sequence > 0 with 0 < < 1 and 

with some abuse of notation denote by n„ the sequence of discretized BM priors 
defined above with h = b^- By construction, n„ is supported on PGF{bn)- 
Since PGF{bn) C P, n„ can also be considered as a set of priors on V. 


1.2.3 Posterior Distributions 


Given observations {(tj, drawn according to the setting of Section 1.1, 

the posterior distribution of 11 has the density function 


n(/ e A\{{U,Xi))/^-^) (X 


Wp{U,x^\f)dn{f) 


/ WPf{ti){xi)p{ti)dIl{f), 


(5) 


where the equality in (5) follows by applying (1) and (2) to the estimator / of 
fo- 
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1.3 Main Theorems: Posterior Consistency and Rate of 
Contraction 

We establish the posterior consistency for the discretized and continuous BM 
priors respectively. That is, we show that as n approaches inhnity, the posterior 
distributions contract with high probability to the distribution (recall that 
/o is the underlying function in 7^). Furthermore, for the discretized BM we 
study the rate of contraction of the posterior distribution. The theorem for the 
discretized BM is formulated in Section 1.3.1 and the one for the continuous 
BM (with weaker convergence) in Section 1.3.2. 


1.3.1 Posterior Consistency and Rate of Contraction for Discretized 
BM 


Theorem 1.1 below formulates the rate of contraction of the posterior distribu¬ 
tion of the discretized BM with respect to the Lq metric on P, where 1 < g < oo. 
This metric, dq, is defined as follows: 


d,{h,h) 


dist 


M 


/te[o,i] 


\ 1/9 


( 6 ) 


where distM denotes the geodesic distance on M and p{t) is the pdf for the 
predictor t. 


Theorem 1.1. Assume a regression setting with a predictor variable t G [0,1], 
whose pdf p{t) is strictly positive on [0,1], a response variable x in a com¬ 
pact finite-dimensional Riemannian manifold M and an underlying and un¬ 
known Lipschitz function fo G V, which relates between x and t according 
to (1) and (2). Assume an arbitrarily fixed 0 < e < 1/4 and for n G N, 
let bn = be the sidelength of the set PGF{b„) and let {n„}„gj^ de¬ 

note the sequence of discretized BM priors on PGF{bn). Then there exists an 
absolute constant Aq and a fixed constant Go depending only on the positive 
minimum value of p(t) on [0,1], the volume of M and the Riemannian met¬ 
ric of , such that Iln{-\{{ti,Xi)}'f^i) contracts to fo according to the rate 
Cn = \/bnlGo = More precisely, for any 1 < q < oo 


^n{f ■ dq{f,fo) P dioen\{{ti, Xi)}'f^i) —> 0 


in Pq -probability (see (3)/ as n ^ oo. 

The proof of Theorem 1.1 appears in Section 2 and utilizes a general strategy 
for establishing contraction according to [14]. The significance of the theorem 
is in properly determining the sidelength parameter (as a function of n). 
Practical application of the discretized BM prior can suffer from underfitting 
or overfitting as a result of too small or too large choice of &„ respectively. 
Theorem 1.1 implies that for n observations, should be picked as 
to achieve a contraction rate of for any fixed e > 0. 

lyiore precisely, the dependence of the constant Cjj (which is later defined in (15)) on the 
Riemannian metric. 
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1.3.2 Posterior Consistency for Continuons BM 

We show here that the posterior distribution n(‘|{(ti, is weakly consis¬ 

tent. In order to clearly specify the weak convergence, it is natural to identify 
functions in V with density functions of observations. Let T) denote the set 
of densities p{t,x) from which the observations C [0,1] x M are 

drawn. Assuming a fixed variance cr^, a function f € V can be identified with 
a density function pf G V as follows: 

: /—> Pf{t,x) ■=Pa‘^{f{t),x)p{t). (7) 


Therefore, 11 induces a prior on the set I?, which is again denoted by 11 with 
some abuse of notation. For the simplicity of analysis, we assume here that 
is known. Section 4.1 discusses the modification needed when cr^ is unknown. 
For the underlying function /o, we dehne its weak neighborhood of radius e 

by 


e P 


'[0,l]xM 


PgPfdtdfj,(x) — 


'[0,l]xM 


PgPf„dtdp{x) 


< e, Vg G "P 


Theorem 1.2 states the weak posterior consistency of the continuous BM prior 
n. It is proved later in Section 3. 


Theorem 1.2. If M is a compact Riemannian manifold and if the true under¬ 
lying function fo € V of the regression model is Lipsehitz continuous, then the 
posterior distribution n(-|{(p, is weakly consistent. In other words, for 
any e > 0, 

n(iV,(/o)|{(P,a:.)}”=i)^l 

almost surely w.r.t. the true probability measure Pq (defined in (3)) as n —>■ oo. 


1.4 Main Contributions of This Work 

The first contribution of this paper is the proposal of a natural model for 
manifold-valued regression (with real-valued predictors). Indeed, the heat kernel 
on the Riemannian manifold gives rise to an averaging process, which general¬ 
izes basic averages of vector-valued regression. In particular, the heat kernel on 
is the same as the Gaussian kernel (applied to the difference of f{t) and x), 
which is widely used in regression when x G (due to an additive Gaussian 
noise model). The Bayesian setting is natural for the proposed model, since it 
uses the discretized or continuous Brownian motion on M as a prior distribution 
of / and it does not directly use the heat kernel. It is not hard to simulate the 
Brownian motion, but tight estimates of the heat kernel for general M are hard. 

The second and main contribution of this work is the derivation of the con¬ 
traction rate of the posterior distribution for the discretized Brownian motion. 
To the best of our knowledge the rate of contraction was only established be¬ 
fore for regression with real-valued predictors and responses. For this case, van 
Zanten [32] established contraction rate for the posterior distribution of 
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n samples under the Lp-norm, where 1 < p < oo. His analysis does not seem 
to extend to our setting. It is unclear to us if this stronger contraction rate 
also applies to the general case of manifold-valued regression (see discussion in 
Section 6.3). 

The third contribution is the consistency result for the continuous Brownian 
motion. The only other consistency result for manifold-valued regression we are 
aware of is by Bhattacharya [3]. It suggests a general nonparametric Bayesian 
kernel-based framework for modeling the conditional distribution x\t, where the 
predictor t and response x take values in metric spaces with kernels. Under a 
suitable assumption on the kernels, [3] established the posterior consistency for 
the conditional distribution w.r.t. the Li norm (see [3, Proposition 13.1]). We 
remark that [3] applies to responses and predictors in Riemannian manifolds 
(where the corresponding metric kernels are the heat kernels). However, both 
the conditional distribution (of x given t) and the prior distribution are different 
than the ones proposed here. It is unclear how to obtain a rate of contraction 
for [3]. 

The last contribution is the implication of a new numerical procedure for 
manifold-valued regression, which is based on simulating a Brownian motion on 
M. The flexibility of the shapes of the sample paths of the Brownian motion is 
advantageous over state-of-the-art geodesic regression methods. Real applica¬ 
tions often do not give rise to geodesics and thus the nonparametric regression 
method is less likely to suffer from underfitting. Another nonparametric ap¬ 
proach is kernel regression [15, 3]. In Section 5, we compare between kernel 
regression and Brownian motion regression (our method) for a particular exam¬ 
ple, which is easy to visualize. 

1.5 Organization of the Rest of the Paper 

The paper is organized as follows. Theorems 1.1 and 1.2 are proved in Sections 2 
and 3 respectively. Section 4 extends the framework to the cases where is 
unknown and p{t) is supported on a subset of [0, Ij. Section 5 demonstrates the 
performance of the proposed procedure on a particular example, which is easy 
to visualize, and compares it to kernel regression [15, 3]. 


2 Proof of Theorem 1.1 

Our proof utilizes Theorem 2.1 of [14, page 4]. The latter theorem establishes 
the contraction rate for a sequence of priors n„ over the set V of joint densities 
of the predictor t and response x under some conditions on n„ and the covering 
number of T>. We thus conclude Theorem 1.1 by establishing these conditions. 

We use the following distance dq^T> on the space V with an arbitrarily fixed 
1 < g < oo: 

dq,v{Pl,P2) = ^\\Pl -P2\\q for Pl,P2& 'D■ 

The regression framework is formulated in terms of the space V (see Section 1.1, 
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in particular, the mapping oiV to V in (7)) and the metric dq on V (see (6)). 
We also use the doo metric on V, which is defined by 

doo{fi,f2)= nrax distM(/i(i),/2(t)), (8) 

iG[0,l] 

The proof is organized as follows. Section 2.1 shows that under the map¬ 
ping (7) of V to V, dq^x) is bounded from below by dq (and above by doo)- 
Therefore, the posterior contraction w.r.t. dq^-o implies the posterior contrac¬ 
tion w.r.t. dq. Then, Sections 2.2-2.4 show that if the sidelengths {fcnlneN and 
a constant a > 0 are chosen properly, then the priors {n„}„gN and the sieve of 
functions {'P„,a}neN (defined later in (21)) satisfy conditions (2.2)-(2.4) respec¬ 
tively in Theorem 2.1 of [14]. The posterior contraction of n„ is then concluded. 


2.1 Relations between (ig®, dq and d^o 

We formulate and prove the following lemma, which relates between dq^-D, dq 
and doo- It is later used as follows: The first inequality of (9) deduces Lq 
convergence in V from Lq convergence in V. The second inequality of (9) is 
used in finding the covering number of the space V. 

Lemma 2.1. If 0 < rup, Mp G M and nip < p{t) < Mp for all t G [0,1], 
then there exists two constants Co,Ci > 0 depending only on nip, Mp and 
the Riemannian manifold M such that for any /i, /2 € V with corresponding 
densities p, pf 2 in V (via (7)) 

Codqifi,f2) < dg.x)(p/i,p/J < C'idoo(/i,/2)- (9) 


Proof. For xi ^ X 2 , we define the function 

p/ ^ \Pcr^{xi,y) - Pa^{x2,y)\ 

F{xi,X2,y) = - ,. , , -T-■ 

dlStMl^l? ^2) 


( 10 ) 


We note that the first inequality of (9) is true if there exists a constant Co > 0 
such that 


lyeM 


F{xi,X2,yYdp{y) > 


Cl 


m. 


q-l ’ 


Vxi ^ X2 G M. 


( 11 ) 


Since M is compact and {x, y) is infinitely differentiable, for any e > 0, 
there exists d > 0 such that 


F{xi,X2,y) 


dpa^{xi,y) 

dvi2 


< e, \lxi,X 2 ,y G M, d\stM{xi,X 2 ) < 6, (12) 


where V 12 G is the unit vector of the geodesic connecting xi and X 2 . 

Since the heat kernel Pcr 2 {xi,y) is not constant and due to the compactness of 
the space of unit tangent vectors, there exists Cq > 0 such that 


f 

dp„ 2 {xi,y) 

ly&M 

dvi 2 


dpiy) > C'q, Vxi G M,vi2 G ljr;i2|| = 1. (13) 



Inequalities (12), (13) and the Schwarz inequality imply that 

J F{xi,X2,yyd^j.{y) > Cj := ^(M)^^ ) ’ € M, distM(a;i, 2 : 2 ) < 5. 

(14) 

If we pick e small enough (with its 5 in (12)), Cj is a positive number. On the 
other hand, if the pair (xi,X 2 ) satisfies that distM(a^i 72 : 2 ) > d, we show that 
for some constant Cjj > 0, 



F{xi,X2,yydy{y) > Cu, 


yxi,X2,y € M, distM(a^i, 22 ) > 6. 


(15) 


Since the set {(xi, a; 2 )| distM(a;i, 22 ) > <5} is compact, the existence of Cu is 
guaranteed if we can show that 


/ F{xi,X2,y)‘‘dy{y) > 0, yxi,X2,y G M, distM(a;i,X 2 ) > S, 

JyeM 

which can further be reduced to showing that given any pair (xi,X 2 ) G M^, 

3y G M, Pa 2 {xi,y) Pcr 2 {x 2 ,y). (16) 

We prove (16) by contradiction. If (16) is not true, then 

Pa^{xi,y) = p„ 2 {x 2 ,y), Vy G M. (17) 

If we plug y = xi and y = X 2 respectively in (17), and use the symmetry of the 
heat kernel, to get Pa 2 {xi,xi) = p„ 2 {xi,X 2 ) = Pa 2 (x 2 ,X 2 ), which means that 

p„2{xi,X2) = yJp„2{xi,Xi)p„2{x2,X2). (18) 


On the other hand, 


Pa2{xi,X2)= / p„2/2{xi,z)pa2/2{z,X2)dp{z) 

J zeM 

p^2/2{xi,zydp{z) 

eM 

= \/Pcr2{xi,Xi)p„2{x2,X2). 

In view of (18) the Cauchy-Schwartz inequality used in (19) is an equality and 
consequently 

Pcr2/2{X1,Z) = P^2I2 {x2,z), WzGM. 

Applying the same argument iteratively, we conclude that for any m > 0, 


/ zGM 


Pa2/2{z,X2)'^dp{z) 


(19) 



Pa2/2^{X1,Z) = P„2I2 ^{x2,z), V2: G M. 

However, as m oo, ^^ 2 / 3 ™(^i, 2 ;) 6^^ but p„ 2 i 2 ,^(x 2 .,z) -G 5^^ ^ 5^^. 

This is a contradiction. Inequality (16) and thus (15) are proved. We conclude 

from (14) and (15), the first inequality of (9) with Co = (min(C/, C//)m®“^)^^^. 


9 



Next, we establish the second inequality of (9). Theorem 4.1.4 in [18, page 
105] states that p „2 (x, y) is infinitely differentiable in both variables x and y. 
In particular, its first partial derivatives are continuous. Furthermore, the fact 
that M is compact implies that the first partial derivatives are bounded. That 
is, there exists Cm > 0 such that 


Consequently, 


dp^2{x,y) 


dx 


< Cm, 


dp „2 (x, y) 


dy 


< Cm- 


\p„2{xx,y) - p„2{x2,y)\ <CMA\siM{xi,X2)- ( 20 ) 

Applying (20) and then bounding p(t) by Mp and distM by doo, we conclude (12) 
with Cl = CM^p{M) as follows: 

dq,viPh,Ph)= (^JJ - P,T^{f2{t),y)pit)\'^ dpiy)dt^ 

< (^JJ Cm distM(/i(0> f2{t))‘‘p{tydp{y)dt^ 

<C'MM^«-l)/V(M)doo(/l,/2). 

□ 

Remark 2.2. IFe note that when q = 1, the constants Cq, Ci in Lemma 2.1 are 
independent of p(t). In particular, in this case the condition nrip < p{t) < Mp is 
not needed. 


2.2 Verification of Inequality 2.2 of [14] 

We estimate the covering numbers of special subsets of V and T). The hnal esti¬ 
mate verifies inequality 2.2 of [14]. We start with some notation and definitions 
that also include these special subsets of V and V. 

For 0 < a < 1 and / € "P, let 


max 

ii.t 2 e[o.i] 


distM(/(C),/(t 2 )) 
|ti -i 2 |“ 


and 

Pa :={/eP|||/IU <oo}. 

For a sequence {Mn}nefi increasing to infinity we define the sieve of functions 

P«.a={/ePa|||/|U<M„}. (21) 

This induces a sieve of densities of Va by the map (7). For e > 0 and a 
metric space £ with the metric d, we denote by N(e,£, d) the e-covering number 
of £, which is the minimal number of balls of radius e needed to cover £. 
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In the rest of the section we estimate the covering numbers of the sets M, 
Vn,a and Vn^a- We assume a decreasing sequence e„ approaching zero. Sec¬ 
tion 2.2.1 upper bounds A^(e„, M, distjvi) for an arbitrary such sequence e„. 
Section 2.2.2 upper bounds N{en,T’n,a,doo) for arbitrary sequences e„ and 
as above. At last, Section 2.2.3 upper bounds fV(e„, I’n,cn for sequences 
e„ and M„ satisfying an additional condition (see (37) below). It verifies in¬ 
equality 2.2 of [14]. 


2.2.1 Covering Numbers of M 

For any e„ > 0, we construct an e„-net on the H-dimensional compact Rieman- 
nian manifold M. Let D(M) be the diameter of M. That is, 

D(M) = max distM{x,y). 

x,y^M 

The Nash embedding theorem [23] and Whitney embedding theorem [35] imply 
that there exists an isometric map 

E : M —^ 


Since D{E{M)) < D(M), the image E{M) is contained in an hypercube HC 
with side length 2D(M). We partition this EIC as a regular grid with grid 
spacing enl^/‘2.D in each direction. Since each point in EIC has distance less 
than e„ to some grid vertex, the set of grid vertices, GR(e„), is an e„-net of 
EIC. Thus the e„-covering number oi EIC can be bounded as follows: 


N(e„, iLC, distR2D) < 


( 2D(M) \ 

Ve„/y^/ 


( 22 ) 


Next, we construct an e„-net of M using the e„/3-net Gy(e„/3) of EIC. 
To begin with, we show in Lemma 2.3 that the Riemannian distance and the 
Euclidean distance are equivalent locally under an isometric embedding. 


Lemma 2.3. Let M be a compact Riemannian manifold and E be an isometric 
embedding to Then for any fixed constant C > 0, there exists a constant 

dc > 0 such thatyx,y G M with distR 2 D {E(x), E{y)) < 6c, 


I distM(a:,i/) - distM_ 2 D{E{x), E{y))\ < Cdistm_ 2 n{E{x), E{y)). 

Proof. Suppose this is not true. Then there exists a sequence of {xn,yn) G 
such that distR2D (A(x„), E(j/„)) —> 0 and 


I distM(a;„,y„) - distR2D(L;(a;„),E(2/„))| > GdistR2£)(F;(a;„),E(?/„)). (23) 


Since M is compact, there is a subsequence, denoted again by (a;„,j/„), and 
a point z G M such that Xn,yn —> 2. By picking an orthonormal basis of 
the tangent space TzM and using the exponential map exp^,, one has normal 
coordinates 

$ : = TzM D BziO,r) —^ M 
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where i?z(0,r) is the r-ball centered the origin on TzM. Let log^, = exp^ ^ 
be the logarithm map at 2 and dist/ be the Euclidean distance on TzM. Let 
x„ = log^(a;„) and y„ = log^(j/„). Applying Lemma 12 in [33, page 24] for 

^715 Yni 

I distM{.Xn,yn) “ dist/(x„, y„)| < 0 (max{||x„|| 2 , ||yn||i}) dist/(x„, y„). (24) 
Let / be the composition of 4> with E, 

f : M^DBz(0,r )—> . 

We note that /(x„) = E{xn) and /(y„) = E{yn). The Tyler series of / is 
/(yn)-/(x„) = (V/(x„))^(y„-x„) + ^(y„-x„)^(V^/(x„))(y„-x„) + --- . 
This implies that 

WfiVn) - /(x „)||2 = ||(V/(x„))'^(y„ - x „)||2 + 0 (|jy„ - x„||^). (25) 

On the one hand, since E is an isometric embedding, the linear map 

V/(0) : R^ —^ R^^ 

preserves the Euclidean distance. On the other hand, the smoothness of / 
implies that V/(x) has bounded derivatives. Thus, 

V/(x„) = V/(0) + O(|!x„|| 2 ). (26) 

Then, (25) and (26) and the triangle inequality imply that 

ll/(yn) - /(x«)||2 = ||yn - X „||2 + 0(||x„||2||y„ - X„ || 2 + ||y„ - X„|||). 

In other words, 

I dist 7 (x„,y„) -distK 2 c(E(x„),£;(j/„))| < 0 (||x „||2 + ||y„-x„|| 2 ) dist/(x„, y„). 

(27) 

By (24) and (27), 

I distM(a;ri,y„) - distR 2 D(E;(a;„),E(y„))| < c„ dist/(x„, y„), 

where c„ = 0 (||x „||2 + ||y„ - x „||2 + max{||x„||^, ||y„||i}). Moreover, by (27), 

dist/(x„,y„) < (1 - 0 (||x „||2 + |jy„ - x„|| 2 ))“^ distR 2 c (E(x„), E(i/„)). 

Therefore, if = c„/(l - 0 (|!x „||2 + ||y„ - x„|| 2 )), then 

I distM(a;n,J/n) - distR2o{E{Xn), E{yn))\ < c'„distR2o{E{xn), E{yn)). 

We note that c(j —)■ 0 as n —>■ oo since x„, y„ —)• 0 and this contradicts assump¬ 
tion (23). 

□ 
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Now, we construct an e„-net of M from Gt^(e„/3). 

Lemma 2.4. Let GV{en/3) = {a; € Gt^(e„/3)| distR 2 D (a;, M) < e„/3}. There 
exists a constant (5 > 0 such that i?“^(Proj£;(^)(Gt^(e„/3))) is an Cn-net of M 
when Cn < S. Consequently, 

N{€n, M,distM) < fV(e„/3, i/G,distR 2 £)). 

Proof. Suppose Cn < S := where (5i/3 is the constant 5c in Lemma 2.3 
with G = 1/3. For any point x € M, let y be the vertex in Gy(e„/3) that 
is closest to E{x) w.r.t. distR 2 D. Then, by definition, y G Gy(e„/3). Let 
2 ; = i?“^(Proj£;(M)(y)). To prove the lemma, it is sufficient to show that 

distM(a;, 2 :) < e„. (28) 

Since distR2i3(iil(x),i?( 2 :)) < 2e„/3 < 5i/^, Lemma 2.3 states that 

|distM(a:, 2 ) - distR 2 D(F;(a;),£;( 2 :))| < ^ distR 2 D (F;(a;), S(z)). (29) 

Inequality (29) implies (28) and thus the lemma as follows: 

4 

distM(a;, z) < - distR 2 D(£;(a;), £^( 2 ;)) < 8e„/9. 

□ 

From now on, we fix an e„/3-net S'„ of M, generated as above from the pro¬ 
jection of regular grid vertices Gy(e„/9) of HC with grid spacing e„/(9-\/2I?). 
Lemma 2.5 provides an upper bound of the number of points in in the e„- 
neighborhood of x G S'„. 

Lemma 2.5. For x G Sn and X := {y G S„ \ distM(a;,y) < e„}, 

#X < 21^^. 

Proof. It y G X C Sn, then there is a point 2 ; G GV{en/9) such that 

E~^(PTo]E(^M)iz)) =y and distR 2 D( 2 ;,£(?/))< e„/9. (30) 

We note that 

distR2D(£(x),£(?/)) < distMix,y) (31) 

since E is an isometric embedding. Inequalities (30) and (31) and the triangle 
inequality imply that distR 2 D (£(x), z) < 10e„/9. Thus, if 

Y := {z G Gy(e„/9)|distR2D(£(x),z) < 10e„/9}, 

then X C £“^(Proj^(jv^)(F)). Since the grid spacing is e„/9, ffX < ffY = 

2i2D_ 

□ 
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2.2.2 Covering Numbers of 'Pn,a 

Recall that PGF{a) C Va is the set of piecewise geodesic functions which map 
each interval [ka, {k + l)a] to a geodesic on M for 0 < fc < 1/a. We define 

Fs^ia) = {/ G FGF{a)\f{ka) G S'„ for 0 < fc < 1/a}, 

where S'„ was defined just before Lemma 2.5. The following Lemma upper 
bounds N{en, Vn,a, doo)- It uses the constant Ji/g which was defined in Lemma 2.3 
(here G = 1/9). 

Lemma 2.6. If M is a D-dimensional compact Riemannian manifold with 
diameter D{M), two sequences {cnlneN such that Mn —>■ oo and Cn < 

^ 1 / 9 , and a = 3 ^, then there is a subset of Fs„(a), which forms an Cn-net of 
Vn,a and 

jV(en,K.a,doo) < (21^^)"^"^^" |^ 18v^Il(M) ^ _ ^^2) 

Proof Given / G 'Pn,ai an approximation / G Fs„{a) is determined uniquely 
by specifying its boundary value f{ka) for 0 < A: < 1 /a, which is given by 

f{ka) = arg min^gg^ distM(a:, f{ka)). 

To show that dadf, f) < We check the inequality distM(/(A)) f{t)) < for 
alHG[0,1]. Suppose t G [fca,/ca + a]. Since ||/||q < M„, 

distMifika), f{t)) < Mna < e„/3. (33) 

Moreover, because / is a mapping to a geodesic on [ka, ka + a] and the fact that 
Sn is e„/3-net of M, 

distMifika), f{ka)) < e„/3 (34) 

and 

distMifika), fit)) < distMifika), fika + a)) (35) 

< distMifika), fika + a)) + distMifika + a), fika + a)) 

< 2e„/3. 

It follows from (33), (35) and the triangle inequality that 

distM(/(f),/(f)) < e„. (36) 


Define subset of (a): 

SFs^ia) = {/ G Fs„(a)|distM(/(fca),/(fca + a)) < e„, VO < /c < l/o}. 

By the definitions of / and Sn and (33), we conclude that / G 5'Fs^(a). Thus, 
SFs„ia) is an e„-net of Vn,a- 
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By definition, N{en,Vn,a,doo) < ^SFs^{a). It is thus sufficient to estimate 
=ffSFs„{a). Lemma 2.4 and (22) imply that 


#Sn< 


/18D(M)y 
[en/V^J ’ 


which is the upper bound of the number of values that /(O) can take. Given 
the value of f{ka), there are 21^^ choices for f{ka + a) by Lemma 2.5. Thus, 
for a = e„/(3M„), (32) is concluded as follows 


#SFsJa) < / W2MW 

y 


2D 


□ 


2.2.3 Covering Numbers of 

In this section, we prove the following lemma. 
Lemma 2.7. //M„ satisfies 

M < __ 

” - 6C'ii^(log(21) + 1) ’ 


(37) 


where Ci was defined in Lemma 2.1, then for n sufficiently large Vn^a satisfies 
the inequality 2.2 of [If, Theorem 2.1], that is. 


\o^ N{eji,T)ji Q,, dqyo^ fii 


(38) 


Proof Recall that = ^i'Pn.a) and d,,x)(p/i,P/ 2 ) < C'idoo(/i,/ 2 ) (see 

Lemma 2.1). A consequence of this is that an e„-net of 'Dn,a can be induced 
from an Cn/Ci-net of Vn,a- Therefore, 


N{en,T)n,a,dq^v) < N{Cn/Ci,Vn,a, doo) < ( 21 ^^) 


To conclude (38), it is enough to show that 


2D\3CiM„/e„ ( 18C'iV^D(M)\ 


2D 




{2D log(21) + 2D) + 2L)log 


’l8C'iV^D(M)\ 

3CiM„ ) 


< ne„. 


J 

(39) 


(40) 


We verify it for n sufficiently large. Since Mn —>■ oo, the second term of the 
LHS of (40) will be less than zero for large n. On the other hand, it follows 
from (37) that the first term of the LHS of (40) is less than or equal to nef. 

□ 
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2.3 Verification of Inequality 2.3 of [14] 

Recall that the prior n„, with support on PGF{bn) C Va, is given by the 
discretized Brownian motion at times bn, 25„,..., 1. More specifically, we define 
the prior n„ on PGF{bn) by fixing the joint distribution of /(fc6„) for 0 < fc < 
l/&„, whose density is given by 


i/b„ 

T^nU) = s(/(0)) J][ ifikbn), f{kbn + bn)). (41) 

k=0 

where s is a fixed density function with support on M for /(O), and pb^{x,y) is 
the transition probability from a: to y of the Brownian motion at time 

In this section, we show that if the sequence is properly chosen, then n„ 
satisfies the inequality 2.3 of [14, Theorem 2.1], that is, 

< exp[-ne2(C + 4)]. (42) 

We first establish Lemma 2.8 below and then use it to conclude (42) in 
Lemma 2.9 below (under a condition on 6„). We use the following set 

X := {f e PGF{bn)\distM{fikibn), /(fc2&«)) < V„(fc2&„ - fci6„)“/3, 

VO < fci < ^2 < l/^n} 

Lemma 2.8. The set X is contained in PGF{bn) n Vn,a. 

Proof. By definition of Pn,a, h is enough to show that ii f G X, then 
distM(/(tl),/(t 2 )) < Mn\t 2 - tlT, VO < < t 2 < 1- 


Suppose ti,t 2 G [kbn, {k + 1)6„] for some k without loss of generality. Since / 
is geodesic on this interval and f G X, 

distM(/(ii), /(^2)) = distM{fikbn),fiik + 1)5„)) 

- distMifikbn), fi{k + l)bn)) 


Mn , 

< ^1^2 -til 


Now, let ti G [kibn, {ki + 1)5„] and t 2 G [k 2 bn, {k 2 + l)&n] for ki < k 2 . By the 
triangle inequality. 


distM(/(ti),/(t2)) < ^\kibn + bn-ti\ 
< Mn\t2 — |“. 


+ ^1(^2 - ki- l)bn 


H—^fi2 — kib„ 


This completes the proof. 


□ 
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Next we consider the upper bound of the probability ^n{Pa\^n,a)- It uses 
a constant C 2 which is presented in Theorem 5.3.4 in [18, page 141]. It also 
introduces a constraint on M„ and e„ (see (44)). 

Lemma 2.9. If \ < a <1, bn = M~^ for a constant c s.t. 0 < c < 1/a and 
C'2Vol(M)M^(2i?+3)/2exp[-M2-(2“-i)c/;L8] < exphne^(C + 4)], (44) 

then (42) is satisfied. 

Proof. We define 

Xki,k 2 ■= {/ G PGF{bn)\distMifikibn), /(fc25„)) > Mn{k 2 bn - A:i6„)“/3}. 

When a > Theorem 5.3.4 in [18, page 141] implies that for the constant 

C 2 > 0 


n„(Xfc„fcJ < ^^exp[-52“(M„/3)V(25„)]Vol(M). 

On 

Consequently, 

n„(iPa\iP„.o) < n„(PGF(6„)\(PGF(6„) niP„,„)) 

< n„(PGF(6„)\X) < ^ TiuiXk„k,) 


c- 


u'^OLf i\/r /q\2 , 


^ JF A 2 D-i )/2 exp[-C(M„/3)7(26„)]Vol(M) 

\ bn 

= ^JId+:^2 exp[-6f ~^M^/18]. 

On 


(45) 

(46) 


The first inequality of (46) follows from the fact that the support of n„ is 
PGF{bn). The second inequality of (46) follows from Lemma 2.8. The third 
inequality follows from the definitions of X and Xk-^kj- The fourth inequality 
of (46) follows from (45). The proof concludes by plugging in (46) 

and the fact 

IdniFa\Pn,a) = !!„ (I?Q,\I?„_ct). 

□ 


2.4 Verification of Inequality 2.4 of [14] 

We recall that inequality 2.4 of [14, Theorem 2.1] states that 

n„ fpo (^log < el, Po ^log > exp[-ne2G]. 


(47) 
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We first establish two technical lemmas (Lemmas 2.10 and 2.11) and then 
prove (47) in Lemma 2.12. The formulation of Lemma 2.10 requires the fol¬ 
lowing notation. We recall that by choosing a density p{t) on the predictor t, 
there is a map ^ : V —>■ T>. For simplicity, we use the following notation: 

p{t,x) = $(/) = p^ 2 {f{t),x)p{t), poit,x) = $(/o) = p^ 2 {fo{t),x)p{t), 


where / is any continuous function and fo is the true function. Let Pq be the 
probability with density po{t, x) and Pq/ denote / fdPf„. Here the density p{t) 
of the predictor t is assumed to be positive on [0,1], so that both p{t,x) and 
PQ{t,x) are positive (their exact forms are irrelevant). We consider hrst the 

upper bounds of Pq ^log and Pq ^log . 

Lemma 2.10. There exists a constant C 3 > 0 such that 

Po (log^) < Csd^ifoJ), Po (log^) < Csd^ifoJ). (48) 


Proof. Theorem 4.1.1 in [18, page 102] states that Pa 2 {x,y) is strictly positive 
on M X M. Since M x M is compact, there exists two constants Ci, C 2 > 0 such 
that 

Cl < Pa^ix,y) < C2 (49) 

for all {x,y) G M x M. Moreover, for the same reason, p^ 2 (x,y) is uniformly 
continuous, that is, there exists a constant C 3 > 0 , 


|Po.2(a;i,a;)-p^2(a;2,a;)| < C3distM(a:i,a;2) 'ixi,X 2 ,xGM. (50) 
Then, the inequality log(a:) < x — 1, (49) and (50) imply that 


Po 


log^) = JJ log Podp{x)dt < JJ {po-p)^dp{x)dt (51) 


< 


Similarly, 

Po (log^ ) = 


< 


distMifit), fo{t))pit)dp{x)dt < '^Vo\{M)dao{fo, f)- 
Cl Cl 


log(^ 
Po 




Podfj,{y)dt 

[[ log f ) podp{y)dt+[[ log 
J Jp>po L \Po / } J Jp<po L \ 

Podp{y)dt + jj 


P-Po 


^P<P 0 


Po-P 


Podp{y)dt 


Podti{y)dt 


< -id^ifoj). 


Consequently, (48) is satisfied with C 3 


max(^-^Vol(M), Cg/cj). 


□ 
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Lemma 2.11. Assume that C 3 is an arbitrarily chosen positive constant. If 
/o is a Lipschitz continuous function with the Lipschitz constant L > 0 and 
f € PGF{bn) such that /(fc5„) is in the rn-ball B{fo{kbn),rn) on M, where 

then doo(/o,/) < e^/Ca- 
Proof. Since /o is Lipschitz, 

distMifoikbn), foikbn + t)) < Lb„ VO < fc < l/6„, 0 <t<b„. (52) 

Since / is geodesic on each interval [kb„, kbn + bn], 

distMifikbn), fit)) < distM{f{kbn), fikbn + bn)) for t G [kbn,kbn + bn]. (53) 
By distMifoikbn), fikbn)) < r„, (52), (53) and the triangle inequality, 

distMifikbn), fit)) < 2r„ + L6„. (54) 

Similarly, 

distMifikbn), foit)) < r„ + (55) 

Inequalities (54) and (55) imply that 

distM(/o(0) fit)) < 3r„ + 2L5„ = el/Cs. (56) 

The proof is concluded by the fact that (56) is true for every t. 

□ 

Lemma 2.12. If /o is a Lipschitz continuous function, then there exists a 
sufficiently large constant Cq > 0 such that if bn = and —)■ 00 

(5 > Q), then the sequence of priors n„ satisfies (47) for all n > Ng (Nq 
depends on 5). 

Proof. By Lemma 2.10, it is enough to show that 


n„(/ : Csd^ifo, f) < el) > exp[-nelC], (57) 

where C 3 is the constant in Lemma 2.10. Let / € PGFibn) and L be the 
Lipschitz constant of /q. It follows from Lemma 2.11 that if 

distMifoikbn), fikbn)) < r„ = ^ VO < fc < l/&„, (58) 

0C/3 o 


then dooifo,f) < el/G^. 

Moreover, we note that (52), (58) and the triangle inequality imply that 

distMifikbn), fikbn + bn)) < Lbn + 2r„. (59) 
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It follows from Theorem 5.3.4 in [18, page 141] and (59) that for a constant 

Ci > 0 , 


C 4 

Pbr,if{kbn),f{kbn + bn)) > -^exp 

bn 


distMjfjkbn), fjkbn + 

2bn 

{Lbn + 2r„) ^l 
2 bn 


Recall that the support of n„ is PGF{bn). Therefore, 


n„(/:C3doo(/o,/)<£')< 


exp 


{Fbn 2 t n 

2bn 


Yo\{B{fo{kbn),rn)) 


Vol(i3(/o(0), r„)) -pr C 4 
Vol(M) J-J- 

Since 

Yo\{B{fo{kbn),rn)) > for a constant C 5 > 0, 

the RHS of (60) is at least 


(60) 


1 


Vol(M) 


{CbT, 


£)\l/fcTi + l 


l/b„ 




{Lbn + 2r„)^ 


2bl 


(61) 


Plugging the expression of r„ in (58) and Cobn = for a constant Co > 0, the 
logarithm of (61) being greater or equal to —ne^C is simplified as 


1 


-log 


C 4 C, 


l+b„ 


Vol(M)f>" 


C\ 


2L 


- £»(1 + 5„) log (—- — )-(— + Dbn) log( 6 „) 


,D 


(62) 


1 f2Co 


— w 1 < neiC. 


^ 2 vac's 3 
We fix a constant Cq > 0 large enough so that for all 5„, 

^ C4C)+'’ 




^_2TN< 

3 C 3 3 y - 


The constant Cq exists since —t 0. Moreover, we note that since the fourth 

term of (62) is a constant, to satisfy (62), it is enough to show that 


Y Dbn^ log(&„) < ^nel 


Substituting Cobn = in (63) yields the inequality 


Cn 


’-K 


D D 2 


log 


C, 


l/K' 


2 IK 


C 2 
< -ne? 


(63) 


(64) 
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We note that by using log(a:) < x, it is enough to show that 

K (f + (65) 

2 

If we pick any K > 0 such that — <6, then the right-hand side of (65) 

K 

approaches infinity while the left-hand side is bounded. This implies that there 
exists a constant fVo > 0 such that for all n > No, (65) is satisfied, which 
guarantees that (57) and thus the lemma are true. 

□ 


2.5 Conclusion of Theorem 1.1 

Under the assumptions that 0<c<^ and /o is Lipschitz, we 

showed, in previous sections, that if we pick bn, Mn, e„ such that 

bn = bn = CoCn, —>■ oo and (37) & (44) hold, (66) 


then Theorem 1.1 follows directly from [14, Theorem 2.1]. In this section, we 
conclude the proof by solving the inequalities for parameters and showing the 
optimal choice of the sequence (which determines the contraction rate). 

The first two equalities of (66) imply that 

Mn = (67) 

Plugging (67) into (37) and simplifying the expression yields 

6C'o-^/"CiD(log(21) + 1) < ne3+2/U (68) 


Plugging (67) into (44) and taking the logarithm of both sides (with simplifica¬ 
tion) results in the inequality 


(- log (Co-^"^+'^/'C2Vo1(M)) + {2D + 3) log(e„)) 

+ > ney=-4“+4(C' -h 4). 

18 


c—4a+2 


(69) 


We note that the first term of (69) approaches zero when 4/c — 4 q; -I- 2 > 
0. Therefore, to satisfy (69), we only need that the second term, which is a 
constant, is no less than the right-hand side. That is, 


1 ^-2/c+2a-l ^ 

18 


(70) 


If we pick a, c and e„ so that the right-hand side of (70) approaches zero, 
then (69) is satisfied for large n. It follows from (68), (70) and the fact that 
—>■ 00 that the constants a and c need to satisfy 

2 4 

3-h-<4-p5<- 4a+ 4. 

c c 
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One choice is c = and a = ^. Under this choice, the sequence e„ = 

,^-i/(4 +3 i5 /2) satisfies (68) and (70). Since <5 > 0 can be arbitrarily small, the 
best achievable contraction rate is 

e„ = for any fixed e > 0. 

3 Proof of Theorem 1.2 

We first prove a technical lemma (Lemma 3.1) which requires some definitions 
and then conclude the proof of Theorem 1.2. Let be the Brownian 

bridge probability measure on the path space = {/ € C'([0,T],M) : /(O) = Xo,f{iT) = Xj, VO < i < k}. 
In particular, we denote by Qx,y the Brownian bridge probability measure on 
the path space U = {/ € <^([0, T], M) : /(O) = x, f{T) = y}. 

Lemma 3.1. If x,y € M s.t. distMix,y) < eo/2; then there exists Tq > 0 such 
that 

Qr.y(distM(/,a;) > eo) < 1, VT<To, 

where distM(/,a;) = max distM(f(t),x). In other words, the Brownian bridge 
te[o,T] 

assumes positive measure over the subset of paths {f G V : /([O, T]) C B{x, cq)}. 

Proof. Equation 2.6 in [19] implies that there exists Tg > 0 such that if T < Tg, 
then 

Tlog{Qly{distMif,x) > eg)) < -Cg + 4distM(a:, J/)^. (71) 

That is, 

Qly{distMif,x) > eg) < exp[(-eo + 4distM(a:, j/)^)/r] < 1. (72) 

The first inequality in (72) follows from (71) and the second inequality follows 
from the assumption that distM(a;,y) < eg/2. 

□ 

We now conclude the proof of Theorem 1.2. Recall that the Kullback-Leibler 
(KL) divergence between p and Pf is defined as 

dKL{Pfo,Pf)= pfglogi^] dtdy{x). 

a[o,i]xM \Pf / 

A corollary of Theorem 6.1 in [29] implies that if 11 assumes positive mass on any 
Kullback-Leibler neighborhood oi pfg, then the posterior distribution is weakly 
consistent. Thus, it is enough to show that 

Tl{{f ■■ dKLiPfojPf) < (^}) > 0, Ve>0. 

We note that Lemma 2.10 shows that dK L{pfo,Pf) is upper bounded by doo (/o > /) ■ 

Therefore, we only need to prove that 

n({/: (ioo(/o,/) < e}) > 0, Ve > 0. 
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Fix a positive number ei < e. We consider a regular (e.g., equidistant) 
grid of [0,1] with spacing T. We assume the regular grid satisfies the following 
conditions: 

1. B{xi,ei) C B{x,e), \/x e fo{[iT, {i + l)r]) and Xi = fo{iT), 

2. distM(a:i,Xi+i) < ei/4. 

The Lipschitz assumption of /o guarantees the existence of T. Indeed, Condition 
(1) is guaranteed by the triangle inequality of the metric distM and the Lipschitz 
assumption and Condition (2) is guaranteed by picking a sufficiently small T. 
Given a positive number 6 < ei/24, the triangle inequality implies that 

B{xi, 2ei/3) C B{xi, ei) and distM(^^i, ^^i+i) < ei/3, 'ixi & B{xi,5). (73) 

Applying Lemma 3.1 to Xi and implies that assumes positive 

measure over the set of paths 

Vi = {f: /(O) = xo, /(^T) = /([fT, {i + l)r]) e B{x,, 2ei/3), Vf e [0,1/T]} . 

If / G 14) then for any t € [iT, {i + 1)T], 

f{t) e B{x,, 2ei/3) C B{xi, ci) C B(/o(t), e). (74) 

The first inclusion in (74) follows from (73) and the second inclusion in (74) 
follows from condition (1) of the regular grid. By definition, (74) implies that 
14 C {/ : d^ifoJ) < e}. Therefore, 


n({/:doo(/o,/)<£})> f 

J Xf 




Ql 


Xo,...,Xi/T 


(i4)n„(dx) > 0, 


where n„ is the probability measure of the discretized Brownian motion with 
spacing 6„ = T and x = (ii,..., Xi/tY'■ 


4 Extensions of The Regression Framework 

In this section, we briefly discuss two extensions of the current framework, where 
Theorem 1.1 and 1.2 equally apply. In Section 4.1, we consider the case where 
the variance is unknown. Section 4.2 explains how to possibly relax the 
assumption that p{t) has a positive lower bound. 

4.1 The Case of Unknown Variance 

The mapping $ of (7) assumes that is a fixed and known parameter. If it 
is unknown, the prior on it can be chosen as the uniform distribution on the 
interval [1/A, A] for some constant A > 0 (or other distributions as long as it is 
bounded away from zero and infinity). 
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Under this prior of cr^, the probability density of {t,x) is given by 

Pf{t,x)= [ p^2{f{t),x)p{t)da'^. 

Since (a;, y) and its partial derivatives (w.r.t. x and y) are uniformly continu¬ 
ous in the variable over the interval [1/^, A], it is easy to see that Lemmas 2.1 
and 2.10 still hold for this type of probability densities. Therefore, the contrac¬ 
tion rate for the case of unknown variance is the same as the case of fixed 
variance. 

4.2 More General p{t) 

Throughout the paper, we assume that the distribution of the predictor t has a 
smooth density p{t) on [0,1] with strict lower and upper bounds 0 < rrip < Mp. 
This assumption is used in Lemma 2.1. Since p{t) is continuous, the upper 
bound Mp always exists, but the lower bound can be restrictive. We can relax 
the lower bound on p{t) as follows. Let r > 0 and Sr = {t G [0,1] \p{t) > 
r}. By following the same arguments in the proof, we note that the posterior 
distribution contracts at the same rate to the true function when considering 
the Lq norm of functions restricted to Sr- 


5 Numerical Demonstrations 

In this section, we demonstrate the proposed Bayesian scheme and compare it 
with a kernel method for the simple manifold S^. We also investigate the effect 
of changing various parameters for this special case. 

One reason of using is its simplicity of visualization. Indeed, can 
be identified with the interval [0, 27r] and this makes it easy to plot the S^- 
valued functions. The other reason is that S^, as a Lie group, has the addition 
operator on it. Thus, the kernel method in Euclidean spaces directly applies to 
this situation, with special awareness of the issue of averaging (more specifically, 
the average of the points 0 and 2 tt on is 0, not tt). 

For the discretized and continuous BM Bayesian schemes, we obtain the max¬ 
imum a posteriori (MAP) probability estimators by implementing a simulated 
annealing (SA) algorithm on the corresponding posterior distributions. The 
starting state (function) of SA is defined as follows: the value f{t) at time t is 
the mode of all observed values, whose observation times are in [t—0.05, t-|-0.05]. 
For the discretized BM Bayesian scheme, the sidelength parameter is fixed 
to be 1/40. For the kernel method, we use the Matlab code [6] implemented ac¬ 
cording to the Nadaraya-Watson kernel regression with the optimal bandwidth 
suggested by Bowman and Azzalini [4]. 

We remark that we use Brownian motion of various scales and not the stan¬ 
dard one, BMt, assumed in the proof. Nevertheless, the convergence result 
clearly holds for any scaled Brownian motion BMct, where c > 0. In fact, c is 
an additional hyperparameter (see Section 5.2). 
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5.1 Comparison with kernel regression 

In the first experiment, we compare three estimators, namely, the discretized 
BM MAP (DBM) estimator, the continuous BM MAP (CBM) estimator and the 
kernel regression estimator (KER). We fix the scaling hyperparameter c = 0.01 
for DBM and CBM and the optimal bandwidth for KER. We generate datasets 
of 30 observations according to the pdf defined in (2), where cr^ = 0.1 

and /o : [0,1] —>■ defined by 

/o(f) := (t + 0.5)^, forte [0,1]. 

Figure 1 shows the original function and its different estimators according to 
DBM, CBM and KER. The Li errors between the estimated functions and the 
true function are also displayed. Among them, the CBM achieves the minimal 
Li error. 




t t 




t 


Figure 1: Demonstration of the continuous and discretized BM Bayesian estima¬ 
tors with comparison to a kernel estimator. The data was generated according 
to the pdf pfg^t){x), where /o is demonstrated in the top left subfigure. The es¬ 
timations obtained by CBM (continuous Brownian motion), DBM (discretized 
Brownian motion) and KER (kernel method) are shown in the rest of the sub¬ 
figures together with their Li errors. 


5.2 The hyperparameter c 

The hyperparameter c plays a similar role as the hyperparameter in the reg¬ 
ularized regression. The second experiment shows how the hyperparameter c 
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(with values in {0.01, 0.1,1,10}) affects the estimation. We fix a dataset of 40 
observations with noise variance 0.05 from the same function as in the first ex¬ 
periment. Figures 2 and 3 demonstrate the MAP estimators obtained by CBM 
and DBM respectively. In both figures, the estimators become smoother when 
c decreases. Indeed, smaller c means shorter time for the BM to travel. But 
smaller c also introduces more bias in the estimators. This is more evident 
for DBM in Figure 3 while CBM seems less sensitive to small values of c (see 
Figure 2). 






t t 


Figure 2: The estimations obtained by CBM for different values of c (c = 
0.01,0.1,1,10), where c is the scaling parameter of the Brownian motion. 


5.3 The sidelength parameter bn 

For DBM we have another important parameter, which determines the num¬ 
ber of pieces of a piecewise geodesic function. When = 1, the piecewise 
geodesic function becomes geodesic. In this experiment, we show the change of 
Li error of the DBM estimator for different choices of (l/6„ ranges from 1 to 
100). The data set is generated from the same model as in the first experiment. 
Figure 4 shows that for geodesic functions or functions with large there is 
a large Li error due to large bias. As bn becomes smaller, there is a steady 
decrease of the Li error due to the decrease of bias. 
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Figure 3: The estimations obtained by DBM for different values of c (c = 
0.01,0.1,1,10), where c is the scaling parameter of the Brownian motion. Unlike 
CBM, an underestimation (i.e., sensitivity to bias) is observed when c = 0.01. 
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Figure 4: Li error of DBM for different sidelengths bn- 

6 Conclusion 

We established the consistency of the Bayesian estimator with a Brownian 
motion prior in the manifold regression setting. For the discretized Brown- 
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ian motion, we even specified a contraction rate via a well-known general ap¬ 
proach [14, 31]. We thus propose a new nonparametric Bayesian framework 
with solid statistical analysis beyond the existing kernel methods and Gaussian 
process priors. In fact, one of our motivations to this work is the incapabil¬ 
ity of applying a Gaussian process prior to manifold responses that lack linear 
structure. 

We also list a few interesting questions for possible future study. 

6.1 Better Quantitative estimate of Cq and Ci 

The constants Co and Ci in Lemma 2.1 (comparing the distance of functions and 
the distance of distributions) are not specified due to our proof by contradiction. 
The specification of their dependencies on the underlying Riemannian geometry 
worth further investigation. 

6.2 Lao Convergence 

We only proved Lp-convergence for the Brownian motion prior. It is interest¬ 
ing to investigate the L^o convergence if it exists at all. If it does not exist, 
then it is interesting to know if a smoother prior (e.g., integrated BM) has L^o 
convergence. 

6.3 A Better Contraction Rate? 

For regression with real-valued predictors and responses, van Zanten [32] es¬ 
tablished posterior contraction rate of for n samples under the L,j-norm, 

where 1 < g < oo. His analysis does not seem to extend to our setting. It is 
possible that even for the general case of manifold-valued regression the con¬ 
traction rate is and not just n~^L+<^_ The particular method used here 

does not seem to obtain a better rate. 
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